library(readr)
library(ggplot2)
library(dplyr)
library(glmnet)

theme_set(theme_minimal())

The randomForest and xgboost packages

We will need two new packages today. You can install it with:

if (!require(xgboost))
{
  install.packages("xgboost")
}
if (!require(randomForest))
{
  install.packages("randomForest")
}

library(xgboost)
library(randomForest)

Simulation with randomForest

I think it is useful to do some simulations with trees to get a better grasp on exactly how they are using the data to do predictions. Let’s make a simulated dataset with 10k points and a data matrix with just two columns. The two columns are sampled uniformily in \([0,1]\). The output variable y will be 1 when the first variable is larger than the second and 0 otherwise.

n <- 10000
X <- matrix(runif(n * 2), ncol=2)
y <- as.numeric(X[,1] > X[,2])

The code below fits a very simply random forest model. The most only has a single split (maxnodes = 2 gives a single split with two outputs), always considers both variables (so the only randomness is the sample of data used), and only uses one tree. This is basically a decision tree on a random subset of the training data. The plot shows the predictions using color and puts a dot any misclassified data points. You can run this without making any changes:

model <- randomForest(X, factor(y), mtry = 2, maxnodes = 2, ntree = 1)
pred <- predict(model, newdata = X)

df <- tibble(y = y, x1 = X[,1], x2 = X[,2], pred = pred)

df %>%
  ggplot(aes(x1, x2)) +
    geom_point(aes(color = pred)) +
    geom_point(color = "black", size=0.2, data = filter(df, pred != y)) +
    geom_abline(slope = 1, intercept = 0)

Rerun the same block of code a few times. Notice that (1) the split tends to be around 0.5 and (2) sometimes a split on x1 is used and sometimes a splint on x2 is used. Make sure you understand both of these conditions.

Now, copy the code above in the block below. Change the maximum number of nodes to three and observe what the output looks like. Run it a couple of times to see the variation in each output. Can you deduce what the decision tree looks like each time?

model <- randomForest(X, factor(y), mtry = 2, maxnodes = 3, ntree = 1)
pred <- predict(model, newdata = X)

df <- tibble(y = y, x1 = X[,1], x2 = X[,2], pred = pred)

df %>%
  ggplot(aes(x1, x2)) +
    geom_point(aes(color = pred)) +
    geom_point(color = "black", size=0.2, data = filter(df, pred != y)) +
    geom_abline(slope = 1, intercept = 0)

Copy the code below once more. Change maxnodes to 5 and observe how the tree is trying to estimate the line with slope one and intercept zero. Increase the maxnodes again to 20 and take note of how well this model does compared to the simplier trees.

model <- randomForest(X, factor(y), mtry = 2, maxnodes = 20, ntree = 1)
pred <- predict(model, newdata = X)

df <- tibble(y = y, x1 = X[,1], x2 = X[,2], pred = pred)

df %>%
  ggplot(aes(x1, x2)) +
    geom_point(aes(color = pred)) +
    geom_point(color = "black", size=0.2, data = filter(df, pred != y)) +
    geom_abline(slope = 1, intercept = 0)

Copy the code one more time. Set maxnodes back to 2, but increase ntree to 3 and run the code a few times. You’ll see that sometimes it looks like a single split and sometimes it looks like two splits. What is going on?

model <- randomForest(X, factor(y), mtry = 2, maxnodes = 2, ntree = 3)
pred <- predict(model, newdata = X)

df <- tibble(y = y, x1 = X[,1], x2 = X[,2], pred = pred)

df %>%
  ggplot(aes(x1, x2)) +
    geom_point(aes(color = pred)) +
    geom_point(color = "black", size=0.2, data = filter(df, pred != y)) +
    geom_abline(slope = 1, intercept = 0)

Flights dataset

The dataset for this lab is a collection of domestic flights within the U.S. during the year 2013. There are quite a large number of rows (400k) but a limited number of columns. Some of the most important columns, however, are categorical with a large number of values.

flights <- read_csv("https://github.com/statsmaths/ml_data/blob/master/flights_small.csv?raw=true")
ocol <- ncol(flights)
flights
## # A tibble: 6,000 x 12
##    obs_id train_id delayed month   day weekday arr_hour dep_hour origin
##    <chr>  <chr>      <dbl> <dbl> <dbl>   <dbl>    <dbl>    <dbl> <chr> 
##  1 id_00… train          1     3    27       2       14       12 LAS   
##  2 id_00… valid          1     8    18       6       19       17 PHL   
##  3 id_00… train          0     3    14       3       21       20 BOS   
##  4 id_00… valid          1     8     9       4       12       11 BOS   
##  5 id_00… test          NA    11    25       7       22       21 ATL   
##  6 id_00… train          0     8     7       2       16       14 LAX   
##  7 id_00… valid          0     9    30       7       20       19 ATL   
##  8 id_00… test          NA    11    12       1       19       17 EWR   
##  9 id_00… test          NA    12     9       7        8        6 DTW   
## 10 id_00… valid          1    10     6       6       17       16 SAN   
## # … with 5,990 more rows, and 3 more variables: dest <chr>,
## #   distance <dbl>, carrier <chr>

Below I want you to use five different approaches to predict whether a flight will be delayed. Each time, produce (1) the mis-classification rate on the training and validation sets and (2) where applicable, produce a summary of what variables are most important in the prediction model. Generally, just use all of the variables, though you’ll have a bit more freedom to experiment with what terms to smooth and interact together in the GAM. I have kept from specifying some of the exact details. Play around with what seems to work best… also, make sure you are saving the script frequently in case something crashes!

  1. Fit an elastic net model using cross-validation:
X <- model.matrix( ~ -1 + ., data = flights[,seq(4, ocol)])
y <- flights$delayed

X_train <- X[flights$train_id == "train",]
y_train <- y[flights$train_id == "train"]

model <- cv.glmnet(X_train, y_train, family="binomial")

pred <- as.numeric(predict(model, newx = X) > 0)
sqrt(tapply((flights$delayed == pred)^2, flights$train_id, mean))
##      test     train     valid 
##        NA 0.7739509 0.7677890

It will also be useful to see what the most important variables in the model are:

B <- coef(model, s = model$lambda[5])
B[B[,1] != 0,,drop=FALSE]
## 3 x 1 sparse Matrix of class "dgCMatrix"
##                       1
## (Intercept) -0.34811120
## arr_hour     0.01012931
## dep_hour     0.01432700
  1. Fit a KNN model:
X <- model.matrix( ~ -1 + ., data = flights[,seq(4, ocol)])
y <- flights$delayed

X_train <- X[flights$train_id == "train",]
y_train <- y[flights$train_id == "train"]
X_valid <- X[flights$train_id == "valid",]
y_valid <- y[flights$train_id == "valid"]

for (k in c(3, 10, 25, 50, 100))
{
  pred_valid <- FNN::knn(X_train, X_valid, y_train, k=k)
  print(sqrt(mean((pred_valid == y_valid)^2)))
}
## [1] 0.7362065
## [1] 0.75
## [1] 0.7402702
## [1] 0.7379024
## [1] 0.7341662
  1. Fit a GAM:
library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.8-23. For overview type 'help("mgcv-package")'.
model <- gam(delayed ~ origin + dest + s(dep_hour, arr_hour, distance),
             data = flights,
             family = binomial(),
             subset = train_id == "train")

pred <- as.numeric(predict(model, newdata = flights) > 0)

sqrt(tapply((flights$delayed == pred)^2, flights$train_id, mean))
##      test     train     valid 
##        NA 0.7940403 0.7681146
  1. Fit a random forest model:
flights2 <- flights
flights2$origin <- factor(flights2$origin)
flights2$dest <- factor(flights2$dest)
flights2$carrier <- factor(flights2$carrier)

model <- randomForest(factor(delayed) ~ month + weekday + arr_hour + dep_hour + origin + dest + distance + carrier,
                      data = flights2,
                      subset = train_id == "train",
                      ntree = 100, maxnodes = 15)

pred <- as.numeric(predict(model, newdata = flights2)) - 1L

tapply(flights$delayed == pred, flights$train_id, mean)
##  test train valid 
##    NA 0.677 0.607
  1. Fit a gradient boosted tree:
X <- model.matrix( ~ -1 + ., data = flights[,seq(4, ncol(flights))])
y <- flights$delayed

X_train <- X[flights$train_id == "train",]
y_train <- y[flights$train_id == "train"]
X_valid <- X[flights$train_id == "valid",]
y_valid <- y[flights$train_id == "valid"]

data_train <- xgb.DMatrix(data = X_train, label = y_train)
data_valid <- xgb.DMatrix(data = X_valid, label = y_valid)

watchlist <- list(train=data_train, valid=data_valid)

model <- xgb.train(data = data_train,
                   max_depth = 5, eta = 0.003, nthread = 2,
                   nrounds = 100,
                   objective = "binary:logistic",
                   watchlist = watchlist,
                   verbose=1, print_every_n=10, early_stopping_rounds = 100)
## [1]  train-error:0.371000    valid-error:0.430500 
## Multiple eval metrics are present. Will use valid_error for early stopping.
## Will train until valid_error hasn't improved in 100 rounds.
## 
## [11] train-error:0.356000    valid-error:0.443500 
## [21] train-error:0.357000    valid-error:0.443500 
## [31] train-error:0.356500    valid-error:0.446000 
## [41] train-error:0.355000    valid-error:0.447500 
## [51] train-error:0.355000    valid-error:0.446500 
## [61] train-error:0.354500    valid-error:0.446500 
## [71] train-error:0.354500    valid-error:0.446500 
## [81] train-error:0.358000    valid-error:0.447500 
## [91] train-error:0.357500    valid-error:0.445000 
## [100]    train-error:0.360500    valid-error:0.434500
importance_matrix <- xgb.importance(model = model)
importance_matrix
##       Feature         Gain        Cover    Frequency
##  1:  dep_hour 0.3169062814 2.222412e-01 0.1271911945
##  2:  distance 0.1539465537 1.149334e-01 0.1630656339
##  3:       day 0.0601610199 5.722906e-02 0.0835711374
##  4:   destDEN 0.0597558511 1.008294e-01 0.0668569099
##  5:  arr_hour 0.0564554148 8.910177e-02 0.0835711374
##  6:     month 0.0446229794 5.413137e-02 0.0766408479
##  7:   weekday 0.0394430501 1.847148e-02 0.0986547085
##  8:   destCLT 0.0379445393 5.168204e-02 0.0338361190
##  9:   destEWR 0.0317800262 6.194631e-02 0.0326131268
## 10: carrierAA 0.0309772697 4.546447e-03 0.0297594782
## 11:   destMDW 0.0246821284 3.607814e-02 0.0260905014
## 12: originJFK 0.0186330577 1.712325e-02 0.0146759071
## 13: carrierXE 0.0180173076 2.710544e-02 0.0183448838
## 14: originLAX 0.0173382444 3.954304e-02 0.0150835711
## 15: carrierOH 0.0164748289 2.819657e-02 0.0187525479
## 16:   destMSP 0.0164098304 2.823096e-02 0.0146759071
## 17: carrierUS 0.0152987935 1.000503e-02 0.0240521810
## 18:   destORD 0.0099107499 1.755060e-02 0.0110069303
## 19: originSAN 0.0083819792 2.445547e-03 0.0118222585
## 20:   destJFK 0.0067003938 7.079628e-03 0.0126375866
## 21:   destSFO 0.0039798995 1.469421e-03 0.0085609458
## 22: originCLT 0.0029076657 1.580220e-03 0.0048919690
## 23: originLGA 0.0027392225 4.492030e-03 0.0048919690
## 24: originSFO 0.0026395247 1.791410e-03 0.0069302894
## 25: originATL 0.0013354590 7.010750e-04 0.0081532817
## 26: carrierUA 0.0012084412 8.488053e-04 0.0016306563
## 27:   destSEA 0.0008175935 5.648849e-04 0.0008153282
## 28: originCVG 0.0005318946 8.144905e-05 0.0012229923
##       Feature         Gain        Cover    Frequency

Best solution

Now, as with last time, build the best possible model you can to predict the output. Feel free to make use of the validation set to fit the final model (this is certainly not a requirement, just a suggestion).

Submitting your solutions

Finally, once you have your predictions saved as a variable called pred, run the following code to produce your your results:

submit <- select(flights, obs_id, train_id)
submit <- mutate(submit, pred = pred)
write_csv(submit, "class13_submit.csv")

Now, upload the csv file to GitHub and you are done.